<html>
<head>
<link rel="shortcut icon" href="./favicon.ico">
<link rel="stylesheet" type="text/css" href="./style.css">
<link rel="canonical" href="./Multiplier_Binary_Parallel.html">
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="description" content="A generic signed multiplier module, with the inference left to the CAD tool. Any attributes to control the synthesis of the multiplier should be applied, in the text of the enclosing module, to this whole module. The attributes vary too much and none provide the default automatic inference choices made by the CAD tool (e.g.: logic for narrow widths, DSP blocks for larger widths), which are almost always the right choices.">
<title>Multiplier Binary Parallel</title>
</head>
<body>

<p class="inline bordered"><b><a href="./Multiplier_Binary_Parallel.v">Source</a></b></p>
<p class="inline bordered"><b><a href="./legal.html">License</a></b></p>
<p class="inline bordered"><b><a href="./index.html">Index</a></b></p>

<h1>A Parallel Binary Multiplier</h1>
<p>A generic signed multiplier module, with the inference left to the CAD
 tool. Any attributes to control the synthesis of the multiplier should be
 applied, in the text of the enclosing module, to this whole module. The
 attributes vary too much and none provide the default automatic inference
 choices made by the CAD tool (e.g.: logic for narrow widths, DSP blocks for
 larger widths), which are almost always the right choices.</p>
<h2>Width and Signedness</h2>
<p>Both input word widths are parameterized separately so as to infer the
 smallest multiplier necessary to generate a full product, whose width is
 the sum of the widths of the inputs. <em>The user must manually supply that
 total width to the connecting wires in the enclosing module, as there is no
 way to introspect parameter values inside a module.</em></p>
<p>The multiplier inference is limited to signed integers as the common case
 and to keep the code simple. If you must treat the inputs as unsigned
 integers, first extend them with a constant zero most-significant bit to
 force them positive. Yes, this may cost some area and may slow down the
 logic slightly, but if a single extra bit of width breaks your timing
 closure, then your design's timing closure was already on its last legs,
 and perhaps you need to allow more pipelining and/or use a narrower width.</p>
<h2>Pipelining</h2>
<p>At the time of writing (March 2020), retiming of external registers into an
 inferred multiplier does not work in Vivado, so we cannot simply place
 a <a href="./Register_Pipeline.html">Register Pipeline</a> before and after the
 multiplier module to help it meet timing if necessary.</p>
<p>Instead, we must connect the input and output pipelines and the multiplier
 all together in a single clocked always block to match the recommended HDL
 style for multiplier inference (UG901, <em>Vivado Design Suite User Guide:
 Synthesis</em>). This code also works under Intel Quartus Prime as its HDL
 coding guidelines for inferring multipliers are the same (UG-20131, <em>Intel
 Quartus Prime Pro Edition User Guide: Design Recommendations</em>).</p>
<p><strong>Note</strong>: <em>you must disable shift register extraction during synthesis to
 force implementation of the pipelines as registers which can be retimed and
 can be placed-and-routed independently.</em> Else your pipeline may be
 implemented as shift registers which, while compact, won't provide any
 pipelining benefits. <em>Shift register extraction is enabled by default in
 Vivado and Quartus synthesis.</em></p>
<p>Pipelining multipliers, although optional, benefits both synthesis and
 place-and-route: </p>
<ul>
<li>At a minimum, single input and output pipeline registers will get packed
 into the input and output registers of the DSP blocks, easing timing and
 saving area.</li>
<li>Particularly for wider multipliers, the inferred adder tree which merges
 the partial products from multiple DSP blocks may need an output pipeline
 to meet timing. (The input pipeline does not appear to retime through the
 DSP blocks.)</li>
<li>Finally, since DSP blocks only exist in certain fixed locations on FPGAs,
 abundant pipelining frees the CAD tool to place-and-route the DSP blocks
 away from other timing-critical logic, which would otherwise misleadinly
 appear to be the critical path (due to poor routing) and lead you to waste
 effort trying to optimize the wrong logic!
 For example, I've once had to add a total of 24 pipeline stages (12 input,
 12 output) to a very wide multiplier with a 100-bit product to allow other
 100-bit arithmetic logic to meet timing.</li>
</ul>

<pre>
`default_nettype none

module <a href="./Multiplier_Binary_Parallel.html">Multiplier_Binary_Parallel</a>
#(
    parameter WORD_WIDTH_A      = 0,
    parameter WORD_WIDTH_B      = 0,
    parameter INPUT_PIPE_DEPTH  = 0,
    parameter OUTPUT_PIPE_DEPTH = 0,

    // Don't set at instantiation, except in IPI
    parameter PRODUCT_WIDTH = WORD_WIDTH_A + WORD_WIDTH_B
)
(
    // Unused if no input/output pipelines (combinational multiplier)
    // verilator lint_off UNUSED
    input  wire                             clock,
    // verilator lint_on  UNUSED
    input  wire signed [WORD_WIDTH_A-1:0]   A_in,
    input  wire signed [WORD_WIDTH_B-1:0]   B_in,
    output reg  signed [PRODUCT_WIDTH-1:0]  product_out
);

    localparam WORD_ZERO_A  = {WORD_WIDTH_A{1'b0}};
    localparam WORD_ZERO_B  = {WORD_WIDTH_B{1'b0}};
    localparam PRODUCT_ZERO = {PRODUCT_WIDTH{1'b0}};

    initial begin
        product_out = PRODUCT_ZERO;
    end
</pre>

<p>If their depth is greater than zero, create the input and/or output
 pipelines. These <strong>MUST</strong> be declared as <code>signed</code>, else the multiplication
 will be inferred as unsigned and calculate the wrong results when given
 negative integers.</p>
<p>Then, if necessary, we connect the inputs, the pipelines, and the
 multiplier all together in a single clocked always block. The CAD tool with
 infer DSP blocks and adders, and retime the pipeline registers as necessary
 if retiming is enabled in your CAD tool. <em>Retiming is off by default in
 Vivado and Quartus synthesis.</em></p>
<p>We write the pipelines using the idiom of peeling out the first loop
 iteration so we never generate a negative index with <code>i-1</code>. Note the
 initialization value of <code>i</code> in the for-loops.</p>
<p>There are four possible cases of zero and non-zero (positive) pipeline
 depths, so we generate the correct code, based on the HDL guidelines, for
 each case. A negative pipe depth would result in negative array ranges,
 <em>which is legal in Verilog-2001</em>, though I have no idea what it's for. To
 avoid strange corner cases, no code is generated for negative values, which
 will cause the design to fail to elaborate.</p>

<pre>
    generate

        // No pipelines (combinational multiplier)
        if ((INPUT_PIPE_DEPTH == 0) && (OUTPUT_PIPE_DEPTH == 0)) begin: no_pipe
            always @(*) begin
                product_out = A_in * B_in;
            end
        end

        // Input pipeline only
        else if ((INPUT_PIPE_DEPTH > 0) && (OUTPUT_PIPE_DEPTH == 0)) begin: in_pipe
            reg signed [WORD_WIDTH_A-1:0]  input_pipe_A  [INPUT_PIPE_DEPTH-1:0];
            reg signed [WORD_WIDTH_B-1:0]  input_pipe_B  [INPUT_PIPE_DEPTH-1:0];

            integer i;

            initial begin
                for (i=0; i < INPUT_PIPE_DEPTH; i=i+1) begin
                    input_pipe_A [i] = WORD_ZERO_A;
                    input_pipe_B [i] = WORD_ZERO_B;
                end
            end

            always @(posedge clock) begin
                input_pipe_A[0] <= A_in;
                input_pipe_B[0] <= B_in;

                for (i=1; i < INPUT_PIPE_DEPTH; i=i+1) begin: per_input
                    input_pipe_A [i] <= input_pipe_A [i-1];
                    input_pipe_B [i] <= input_pipe_B [i-1];
                end
            end

            // Corner case: it isn't possible to put this in the clocked
            // always block without registering the output as a consequence.
            always @(*) begin
                product_out = input_pipe_A [INPUT_PIPE_DEPTH-1] * input_pipe_B [INPUT_PIPE_DEPTH-1];
            end
        end

        // Output pipeline only
        else if ((INPUT_PIPE_DEPTH == 0) && (OUTPUT_PIPE_DEPTH > 0)) begin: out_pipe
            reg signed [PRODUCT_WIDTH-1:0] output_pipe   [OUTPUT_PIPE_DEPTH-1:0];

            integer i;

            initial begin
                for (i=0; i < OUTPUT_PIPE_DEPTH; i=i+1) begin
                    output_pipe [i] = PRODUCT_ZERO;
                end
            end

            always @(posedge clock) begin
                output_pipe [0] <= A_in * B_in;

                for (i=1; i < OUTPUT_PIPE_DEPTH; i=i+1) begin: per_output
                    output_pipe [i] <= output_pipe [i-1];
                end
            end

            always @(*) begin
                product_out = output_pipe [OUTPUT_PIPE_DEPTH-1];
            end
        end

        // Both input and output pipelines
        else if ((INPUT_PIPE_DEPTH > 0) && (OUTPUT_PIPE_DEPTH > 0)) begin: in_out_pipe
            reg signed [WORD_WIDTH_A-1:0]  input_pipe_A  [INPUT_PIPE_DEPTH-1:0];
            reg signed [WORD_WIDTH_B-1:0]  input_pipe_B  [INPUT_PIPE_DEPTH-1:0];
            reg signed [PRODUCT_WIDTH-1:0] output_pipe   [OUTPUT_PIPE_DEPTH-1:0];

            integer i;

            initial begin
                for (i=0; i < INPUT_PIPE_DEPTH; i=i+1) begin
                    input_pipe_A [i] = WORD_ZERO_A;
                    input_pipe_B [i] = WORD_ZERO_B;
                end
                for (i=0; i < OUTPUT_PIPE_DEPTH; i=i+1) begin
                    output_pipe [i] = PRODUCT_ZERO;
                end
            end

            always @(posedge clock) begin
                input_pipe_A[0] <= A_in;
                input_pipe_B[0] <= B_in;

                for (i=1; i < INPUT_PIPE_DEPTH; i=i+1) begin: per_input
                    input_pipe_A [i] <= input_pipe_A [i-1];
                    input_pipe_B [i] <= input_pipe_B [i-1];
                end

                output_pipe [0] <= input_pipe_A [INPUT_PIPE_DEPTH-1] * input_pipe_B [INPUT_PIPE_DEPTH-1];

                for (i=1; i < OUTPUT_PIPE_DEPTH; i=i+1) begin: per_output
                    output_pipe [i] <= output_pipe [i-1];
                end

            end

            always @(*) begin
                product_out = output_pipe [OUTPUT_PIPE_DEPTH-1];
            end
        end

    endgenerate

endmodule
</pre>

<hr>
<p><a href="./index.html">Back to FPGA Design Elements</a>
<center><a href="https://fpgacpu.ca/">fpgacpu.ca</a></center>
</body>
</html>

